Here, we will evaluate methods for deriving TWAS SNP-weights based on eQTL summary statistics.
There are two stages to this process:
Existing approaches:
The FUSION and MetaXcan approach is better for determining genes to include in a GWAS as it better reflects the power that TWAS would have. We may not always have a target sample to test the variance explained by the SNP-weights, so the MetaXcan approach is less applicable. So, I think we should use the same criteria as FUSION, except we must use a summary statistic based approach to estimate SNP-based heritability. We should compare summary statistic approaches to estimates from GREML as reported in FUSION released SNP-weights. The approach also needs to be fast as it will need to be implemented for each gene seperately.
SMR uses single eQTL summary statistics, making it applicable to a wider range of datasets and results from meta-analyses. TWAS based methods are not currently relevent when only eQTL summary statistics are available, as the advantage of these methods over SMR is their ability to incorporate the effects of multiple SNPs on gene expression. Currently, TWAS methods use SNP-weights derived using individual level data, which means TWAS cannot use meta-analysis results for eQTLs, and often TWAS weights are unavailable for the latest eQTL datasets. So, here we will compare a range of summary statistic approaches to derive TWAS weights. Again, the method will need to be fast since it will need to be applied to each genes seperately. Summary statistic based polygenic scoring methods will likely be of use, though given the lack of target samples for validation, a pseudovalidation approach will be necessary, which estimates the optimal parameters without a target sample for validation.
Possible methods for deriving SNP-weights:
Comparison to FUSION weights is not ideal as by chance different leads SNP could be selected. Ideally we would compare summary statistic-based and observed TWAS weights to observed expression. However, lets start with this as it will allow comparison of heritability estimates and be a good opportunity to design the study.
We will use whole blood data for the comparison. eQTL summary statistics are downloaded from here. FUSION TWAS SNP-weights were downloaded from here.
We will compare different version of GCTB Bayes models and LDSC. If LDSC doesn’t work, then LDAK model is unlikely to make a difference. To avoid unessecary compuational burden, only estimate hsq for 100 genes on chromosome 22.
Show code
# Split the GTEx eQTL data by chromosome
for chr in $(seq 1 22);do
pattern=$(echo ^${chr}_)
awk -v pat="$pattern" '$2 ~ pat { print }' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.txt > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr${chr}.txt_tmp
cat <(head -n1 /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.txt) /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr${chr}.txt_tmp > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr${chr}.txt
rm /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr${chr}.txt_tmp
done
library(data.table)
# Make output directory
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesS')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/RDat_files')
# Read in HapMap3 SNP-list
hm3_snp<-fread('/users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist')
sbayesr_h2<-NULL
sbayesr_robust_h2<-NULL
sbayess_h2<-NULL
ldsc_h2<-NULL
# Run across each chromosome seperately
for(chr_i in 22){
# Read in eQTL data
eqtl<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr',chr_i,'.txt'))
# Create CHR, BP, A1, A2, and Build columns
var_id<-data.table(do.call(rbind, strsplit(eqtl$variant_id, '_')))
names(var_id)<-c('CHR','BP','A1','A2','Build')
var_id$CHR<-as.numeric(var_id$CHR)
var_id$BP<-as.numeric(var_id$BP)
eqtl<-cbind(eqtl, var_id)
# Insert IUPAC codes for SNPs
eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='A']<-'W'
eqtl$IUPAC[eqtl$A1 == 'C' & eqtl$A2 =='G' | eqtl$A1 == 'G' & eqtl$A2 =='C']<-'S'
eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='G' | eqtl$A1 == 'G' & eqtl$A2 =='A']<-'R'
eqtl$IUPAC[eqtl$A1 == 'C' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='C']<-'Y'
eqtl$IUPAC[eqtl$A1 == 'G' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='G']<-'K'
eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='C' | eqtl$A1 == 'C' & eqtl$A2 =='A']<-'M'
# Remove INDELS
eqtl<-eqtl[!is.na(eqtl$IUPAC),]
# Calculate N for each association
eqtl$N<-round((eqtl$ma_count/eqtl$maf)/2,1)
# Check the build
table(eqtl$Build)
# They are all b37
###
# Insert RSIDs
###
# Read in 1KG reference SNP data
ref_bim_i<-fread(paste0('/mnt/lustre/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/all_phase3.chr',chr_i,'.bim'))
names(ref_bim_i)<-c('CHR','SNP','POS','BP','A1','A2')
# Extract hm3 SNPs
ref_bim_i<-ref_bim_i[ref_bim_i$SNP %in% hm3_snp$SNP,]
# Insert IUPAC codes
ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='A']<-'W'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='G' | ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='C']<-'S'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='G' | ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='A']<-'R'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='C']<-'Y'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='G']<-'K'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='C' | ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='A']<-'M'
# Merge target and reference based on position
ref_target<-merge(eqtl, ref_bim_i, by=c('CHR','BP'))
# I have used the GTEx reference data as well, and the same number is returned.
# Note: The A1 allele is the ref allele.
# Small number of SNPs remaining highlights the limitation of using HapMap3 SNPs only.
# Identify SNPs for which alleles need to be flipped
flip_tmp<-ref_target[(ref_target$IUPAC.x == 'R' & ref_target$IUPAC.y == 'Y' |
ref_target$IUPAC.x == 'Y' & ref_target$IUPAC.y == 'R' |
ref_target$IUPAC.x == 'K' & ref_target$IUPAC.y == 'M' |
ref_target$IUPAC.x == 'M' & ref_target$IUPAC.y == 'K'),]
# Idenitfy SNPs which match the reference alleles
incl<-ref_target[ ref_target$IUPAC.x == 'R' & ref_target$IUPAC.y == 'R' |
ref_target$IUPAC.x == 'Y' & ref_target$IUPAC.y == 'Y' |
ref_target$IUPAC.x == 'K' & ref_target$IUPAC.y == 'K' |
ref_target$IUPAC.x == 'M' & ref_target$IUPAC.y == 'M' ]
# If a SNP that needs to be flipped has a duplicate that is on the correct strand, remove it.
flip<-flip_tmp[!(flip_tmp$SNP %in% incl$SNP)]
# Flip alleles where necessary
flip_tmp$A1_flipped<-flip_tmp$A1.x
flip_tmp$A1_flipped[flip_tmp$A1.x == 'A']<-'T'
flip_tmp$A1_flipped[flip_tmp$A1.x == 'T']<-'C'
flip_tmp$A1_flipped[flip_tmp$A1.x == 'G']<-'C'
flip_tmp$A1_flipped[flip_tmp$A1.x == 'C']<-'G'
flip_tmp$A1.x<-flip_tmp$A1_flipped
flip_tmp$A1_flipped<-NULL
flip_tmp$A2_flipped<-flip_tmp$A2.x
flip_tmp$A2_flipped[flip_tmp$A2.x == 'A']<-'T'
flip_tmp$A2_flipped[flip_tmp$A2.x == 'T']<-'C'
flip_tmp$A2_flipped[flip_tmp$A2.x == 'G']<-'C'
flip_tmp$A2_flipped[flip_tmp$A2.x == 'C']<-'G'
flip_tmp$A2.x<-flip_tmp$A2_flipped
flip_tmp$A2_flipped<-NULL
# Recombine and format
eqtl_harm<-rbind(flip_tmp, incl)
eqtl_harm<-eqtl_harm[,c('CHR','SNP','BP','A1.x','A2.x',"gene_id","variant_id","tss_distance","ma_samples","ma_count","maf","pval_nominal","slope","slope_se","N"), with=F]
names(eqtl_harm)[names(eqtl_harm) == 'A1.x']<-'A1'
names(eqtl_harm)[names(eqtl_harm) == 'A2.x']<-'A2'
# Remove variants with MAF < 1%
eqtl_harm<-eqtl_harm[eqtl_harm$maf >= 0.01,]
eqtl<-eqtl_harm
rm(eqtl_harm)
# Identify unique genes
genes<-unique(eqtl$gene_id)
# Subset the first 100 genes
genes<-genes[1:100]
# Run for each gene seperately
for(gene_i in genes){
print(which(genes == gene_i))
eqtl_gene_i<-eqtl[eqtl$gene_id == gene_i,]
#######
# SBayesR
#######
# Format for SBayesR
eqtl_gene_i_sbayesr<-eqtl_gene_i[,c('SNP','A1','A2','maf','slope','slope_se','pval_nominal','N'), with=F]
names(eqtl_gene_i_sbayesr)<-c('SNP','A1','A2','freq','b','se','p','N')
# write in SBayesR format
fwrite(eqtl_gene_i_sbayesr, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.txt'), sep=' ', na = "NA", quote=F)
# Run SBayesR
log<-system(paste0('/users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh --sbayes R --ldm /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr',chr_i,'.ldm.sparse --pi 0.95,0.02,0.02,0.01 --gamma 0.0,0.01,0.1,1 --gwas-summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.txt --chain-length 10000 --exclude-mhc --burn-in 2000 --out-freq 1000 --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR'), intern=T)
# Run SBayesR
log2<-system(paste0('/users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh --sbayes R --ldm /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr',chr_i,'.ldm.sparse --pi 0.95,0.02,0.02,0.01 --gamma 0.0,0.01,0.1,1 --gwas-summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.txt --chain-length 10000 --robust --exclude-mhc --burn-in 2000 --out-freq 1000 --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR_robust'), intern=T)
#######
# SBayesS
#######
# Run SBayesS
log3<-system(paste0('/users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh --sbayes S --ldm /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr',chr_i,'.ldm.sparse --pi 0.1 --hsq 0.5 --chain-length 25000 --burn-in 5000 --gwas-summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.txt --exclude-mhc --out-freq 1000 --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesS/',gene_i,'.SBayesS'), intern=T)
#######
# LDSC
#######
# Format for LDSC
eqtl_gene_i_ldsc<-eqtl_gene_i[,c('SNP','A1','A2','slope','slope_se','N'), with=F]
names(eqtl_gene_i_ldsc)<-c('SNP','A1','A2','BETA','SE','N')
eqtl_gene_i_ldsc$Z<-eqtl_gene_i_ldsc$BETA/eqtl_gene_i_ldsc$SE
eqtl_gene_i_ldsc<-eqtl_gene_i_ldsc[,c('SNP','A1','A2','Z','N'), with=F]
# write in LDSC format
fwrite(eqtl_gene_i_ldsc, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC/',gene_i,'.txt'), sep=' ', na = "NA", quote=F)
# Run LDSC
log4<-system(paste0('/users/k1806347/brc_scratch/Software/ldsc.sh --h2 /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC/',gene_i,'.txt --ref-ld-chr /users/k1806347/brc_scratch/Data/ldsc/eur_w_ld_chr/ --w-ld-chr /users/k1806347/brc_scratch/Data/ldsc/eur_w_ld_chr/ --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC/',gene_i,'.h2'), intern=T)
########
# Tabulate SNP-h2 results
########
# Read SbayesR heritability result
if(file.exists(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR.parRes'))){
par_res_file_i<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR.parRes'))
par_res_file_i<-par_res_file_i[par_res_file_i$V1 == 'hsq',2:3, with=F]
par_res_file_i$P<-pnorm(-abs(par_res_file_i$Mean/par_res_file_i$SD))
par_res_file_i$Gene<-gene_i
par_res_file_i<-par_res_file_i[,c('Gene','Mean','SD','P'), with=F]
names(par_res_file_i)<-c('gene','hsq','se','p')
sbayesr_h2<-rbind(sbayesr_h2, par_res_file_i)
}
if(file.exists(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR_robust.parRes'))){
par_res_file_i<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR_robust.parRes'))
par_res_file_i<-par_res_file_i[par_res_file_i$V1 == 'hsq',2:3, with=F]
par_res_file_i$P<-pnorm(-abs(par_res_file_i$Mean/par_res_file_i$SD))
par_res_file_i$Gene<-gene_i
par_res_file_i<-par_res_file_i[,c('Gene','Mean','SD','P'), with=F]
names(par_res_file_i)<-c('gene','hsq','se','p')
sbayesr_robust_h2<-rbind(sbayesr_robust_h2, par_res_file_i)
}
# Read SbayesS heritability result
if(file.exists(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesS/',gene_i,'.SBayesS.parRes'))){
par_res_file_i<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesS/',gene_i,'.SBayesS.parRes'))
par_res_file_i<-par_res_file_i[par_res_file_i$V1 == 'hsq',2:3, with=F]
par_res_file_i$P<-pnorm(-abs(par_res_file_i$Mean/par_res_file_i$SD))
par_res_file_i$Gene<-gene_i
par_res_file_i<-par_res_file_i[,c('Gene','Mean','SD','P'), with=F]
names(par_res_file_i)<-c('gene','hsq','se','p')
sbayess_h2<-rbind(sbayess_h2, par_res_file_i)
}
# Read LDSC heritability result
if(file.exists(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC/',gene_i,'.h2.log'))){
par_res_file_i<-readLines(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC/',gene_i,'.h2.log'))
par_res_file_i<-gsub('.*: ','',par_res_file_i[grepl('Total Observed scale h2',par_res_file_i)])
hsq<-as.numeric(gsub(' .*','',par_res_file_i))
se<-as.numeric(gsub(")",'',gsub(".*\\(",'',par_res_file_i)))
p<-pnorm(hsq/se, lower.tail = F)
ldsc_h2<-rbind(ldsc_h2, data.frame(gene=gene_i,
hsq=hsq,
se=se,
p=p))
}
}
}
# Compare hsq estimates across methods
fusion_files<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/Whole_Blood/Whole_Blood/', pattern='.RDat')
fusion_files<-gsub('.wgt.RDat','',gsub('Whole_Blood.','',fusion_files))
fusion_files<-intersect(fusion_files, genes)
greml_h2<-NULL
for(genes_i in fusion_files){
load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/Whole_Blood/Whole_Blood/Whole_Blood.',genes_i,'.wgt.RDat'))
fusion<-data.frame(wgt.matrix)
fusion$SNP<-row.names(fusion)
greml<-c(hsq, hsq.pv)
greml_h2<-rbind(greml_h2, data.frame(gene=genes_i,
hsq=greml[1],
se=greml[2],
p=greml[3]))
}
# Combine results across methods
sbayesr_h2$method<-'sbayesr'
sbayesr_robust_h2$method<-'sbayesr_robust'
sbayess_h2$method<-'sbayess'
ldsc_h2$method<-'ldsc'
greml_h2$method<-'greml'
all_h2<-do.call(rbind, list(sbayesr_h2,
sbayesr_robust_h2,
sbayess_h2,
ldsc_h2,
greml_h2))
write.table(all_h2, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/h2_estimates.txt', col.names=T, row.names=F, quote=F)
# Plot the results
library(data.table)
all_h2<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/h2_estimates.txt')
library(UpSetR)
# Remove estimates which did not converge
all_h2<-all_h2[all_h2$se != 0,]
# Merge estimates an check correlation
all_h2_full_list<-split(all_h2[,c('gene','hsq'),with=F], f = all_h2$method)
for(i in names(all_h2_full_list)){
names(all_h2_full_list[[which(names(all_h2_full_list) == i)]])<-c('gene',i)
}
all_h2_full_dat<-Reduce(function(...) merge(..., by='gene', all=T), all_h2_full_list)
library("ggplot2")
library("GGally")
lowerfun <- function(data,mapping){
ggplot(data = data, mapping = mapping)+
geom_point() +
geom_abline(intercept =0 , slope = 1, colour='red')
}
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/h2_estimates.png', units = 'px', res=300, width=2000, height=2000)
ggpairs(all_h2_full_dat[,-1], lower = list(continuous = wrap(lowerfun)))
dev.off()
# Note. LDSC estimates are very inaccurate and do not correlated well with other methods. SBayesR correlates best wit GREML, and SBayesR and SBayesR with robust setting are highly correlated. The SBayesS method correlates less well with other methods, but this doesn't mean it is inaccurate as we don't know the truth.
# Check the number of genes that are significant in each method and how they overlap across methods.
# Extract significant estimates
all_h2_sig<-all_h2[all_h2$p < 0.01,]
n_gene<-data.frame(table(all_h2_sig$method))
names(n_gene)<-c('Method','N_genes')
# The number is similar across methods, but SBayesR has the highest.
all_h2_sig_list<-split(all_h2_sig[['gene']], f = all_h2_sig$method)
# Plot number of genes with valid and significant estimates
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/n_sig_hsq.png', units = 'px', res=300, width=1500, height=1000)
upset(fromList(all_h2_sig_list), nsets=10, order.by = "freq")
dev.off()
# Again LDSC is least concordant with other methods.
# Compare number overlapping with GREML list
all_h2_greml_sig_list<-all_h2_sig_list
for(i in names(all_h2_greml_sig_list)){
tmp<-all_h2_greml_sig_list[[which(names(all_h2_greml_sig_list) == i)]]
tmp<-tmp[tmp %in% all_h2_greml_sig_list[['greml']]]
all_h2_greml_sig_list[[which(names(all_h2_greml_sig_list) == i)]]<-tmp
}
n_gene$GREML_overlap<-unlist(lapply(all_h2_greml_sig_list, length))
write.csv(n_gene, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/n_sig_hsq.csv', row.names=F)
# Note. SBayesR has the largest overlap with GREML. Given this, and that SBayesR finds the most significant genes, I think we should use SBayesR without robust parameteristation to estimate heritability.
Show SNP-h2 estimates across methods
LDSC estimates are very inaccurate and do not correlated well with other methods. SBayesR correlates best wit GREML, and SBayesR and SBayesR with robust setting are highly correlated. The SBayesS method correlates less well with other methods, but this doesn’t mean it is inaccurate as we don’t know the truth.
Show number of genes with significant SNP-h2
| Method | N_genes | GREML_overlap |
|---|---|---|
| greml | 39 | 39 |
| ldsc | 30 | 21 |
| sbayesr | 36 | 33 |
| sbayesr_robust | 33 | 27 |
| sbayess | 35 | 27 |
The number is similar across methods, but SBayesR has the highest.
SBayesR has the largest overlap with GREML. Given this, and that SBayesR finds the most significant genes, I think we should use SBayesR without robust parameteristation to estimate heritability.
Again LDSC is least concordant with other methods.
Now we have decided which genes to create SNP-weights for, we will generate SNP-weights for predicting gene expression using a range of methods. Then we will compare these SNP-weights to those derived by FUSION. Again, to avoid computational burden, start by running for the first 100 genes on chromosome 22.
Show code
library(data.table)
# Make output directory
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR_robust')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/DBSLMM')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/PRScs')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SuSiE')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/RDat_files')
# Read in HapMap3 SNP-list
hm3_snp<-fread('/users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist')
sbayesr_h2<-NULL
# Run across each chromosome seperately
for(chr_i in 22){
# Read in eQTL data
eqtl<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr',chr_i,'.txt'))
# Create CHR, BP, A1, A2, and Build columns
var_id<-data.table(do.call(rbind, strsplit(eqtl$variant_id, '_')))
names(var_id)<-c('CHR','BP','A1','A2','Build')
var_id$CHR<-as.numeric(var_id$CHR)
var_id$BP<-as.numeric(var_id$BP)
eqtl<-cbind(eqtl, var_id)
# Insert IUPAC codes for SNPs
eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='A']<-'W'
eqtl$IUPAC[eqtl$A1 == 'C' & eqtl$A2 =='G' | eqtl$A1 == 'G' & eqtl$A2 =='C']<-'S'
eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='G' | eqtl$A1 == 'G' & eqtl$A2 =='A']<-'R'
eqtl$IUPAC[eqtl$A1 == 'C' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='C']<-'Y'
eqtl$IUPAC[eqtl$A1 == 'G' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='G']<-'K'
eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='C' | eqtl$A1 == 'C' & eqtl$A2 =='A']<-'M'
# Remove INDELS
eqtl<-eqtl[!is.na(eqtl$IUPAC),]
# Calculate N for each association
eqtl$N<-round((eqtl$ma_count/eqtl$maf)/2,1)
# Check the build
table(eqtl$Build)
# They are all b37
###
# Insert RSIDs
###
# Read in 1KG reference SNP data
ref_bim_i<-fread(paste0('/mnt/lustre/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/all_phase3.chr',chr_i,'.bim'))
names(ref_bim_i)<-c('CHR','SNP','POS','BP','A1','A2')
# Extract hm3 SNPs
ref_bim_i<-ref_bim_i[ref_bim_i$SNP %in% hm3_snp$SNP,]
# Insert IUPAC codes
ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='A']<-'W'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='G' | ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='C']<-'S'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='G' | ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='A']<-'R'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='C']<-'Y'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='G']<-'K'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='C' | ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='A']<-'M'
# Merge target and reference based on position
ref_target<-merge(eqtl, ref_bim_i, by=c('CHR','BP'))
# I have used the GTEx reference data as well, and the same number is returned.
# Note: The A1 allele is the ref allele.
# Small number of SNPs remaining highlights the limitation of using HapMap3 SNPs only.
# Identify SNPs for which alleles need to be flipped
flip_tmp<-ref_target[(ref_target$IUPAC.x == 'R' & ref_target$IUPAC.y == 'Y' |
ref_target$IUPAC.x == 'Y' & ref_target$IUPAC.y == 'R' |
ref_target$IUPAC.x == 'K' & ref_target$IUPAC.y == 'M' |
ref_target$IUPAC.x == 'M' & ref_target$IUPAC.y == 'K'),]
# Idenitfy SNPs which match the reference alleles
incl<-ref_target[ ref_target$IUPAC.x == 'R' & ref_target$IUPAC.y == 'R' |
ref_target$IUPAC.x == 'Y' & ref_target$IUPAC.y == 'Y' |
ref_target$IUPAC.x == 'K' & ref_target$IUPAC.y == 'K' |
ref_target$IUPAC.x == 'M' & ref_target$IUPAC.y == 'M' ]
# If a SNP that needs to be flipped has a duplicate that is on the correct strand, remove it.
flip<-flip_tmp[!(flip_tmp$SNP %in% incl$SNP)]
# Flip alleles where necessary
flip_tmp$A1_flipped<-flip_tmp$A1.x
flip_tmp$A1_flipped[flip_tmp$A1.x == 'A']<-'T'
flip_tmp$A1_flipped[flip_tmp$A1.x == 'T']<-'C'
flip_tmp$A1_flipped[flip_tmp$A1.x == 'G']<-'C'
flip_tmp$A1_flipped[flip_tmp$A1.x == 'C']<-'G'
flip_tmp$A1.x<-flip_tmp$A1_flipped
flip_tmp$A1_flipped<-NULL
flip_tmp$A2_flipped<-flip_tmp$A2.x
flip_tmp$A2_flipped[flip_tmp$A2.x == 'A']<-'T'
flip_tmp$A2_flipped[flip_tmp$A2.x == 'T']<-'C'
flip_tmp$A2_flipped[flip_tmp$A2.x == 'G']<-'C'
flip_tmp$A2_flipped[flip_tmp$A2.x == 'C']<-'G'
flip_tmp$A2.x<-flip_tmp$A2_flipped
flip_tmp$A2_flipped<-NULL
# Recombine and format
eqtl_harm<-rbind(flip_tmp, incl)
eqtl_harm<-eqtl_harm[,c('CHR','SNP','BP','A1.x','A2.x',"gene_id","variant_id","tss_distance","ma_samples","ma_count","maf","pval_nominal","slope","slope_se","N"), with=F]
names(eqtl_harm)[names(eqtl_harm) == 'A1.x']<-'A1'
names(eqtl_harm)[names(eqtl_harm) == 'A2.x']<-'A2'
# Remove variants with MAF < 1%
eqtl_harm<-eqtl_harm[eqtl_harm$maf >= 0.01,]
eqtl<-eqtl_harm
rm(eqtl_harm)
# Identify unique genes
genes<-unique(eqtl$gene_id)
genes<-genes[1:100]
# Run for each gene seperately
for(gene_i in genes){
print(which(genes == gene_i))
eqtl_gene_i<-eqtl[eqtl$gene_id == gene_i,]
# Sort by chromosome and bp
eqtl_gene_i<-eqtl_gene_i[order(eqtl_gene_i$CHR, eqtl_gene_i$BP),]
# Filter SNPs to those with N > 80% of max(N)
eqtl_gene_i<-eqtl_gene_i[eqtl_gene_i$N >= 0.8*max(eqtl_gene_i$N),]
#######
# SBayesR
#######
# Format for SBayesR
eqtl_gene_i_sbayesr<-eqtl_gene_i[,c('SNP','A1','A2','maf','slope','slope_se','pval_nominal','N'), with=F]
names(eqtl_gene_i_sbayesr)<-c('SNP','A1','A2','freq','b','se','p','N')
# write in SBayesR format
fwrite(eqtl_gene_i_sbayesr, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.txt'), sep=' ', na = "NA", quote=F)
# Run SBayesR
log<-system(paste0('/users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh --sbayes R --ldm /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr',chr_i,'.ldm.sparse --pi 0.95,0.02,0.02,0.01 --gamma 0.0,0.01,0.1,1 --gwas-summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.txt --chain-length 10000 --exclude-mhc --burn-in 2000 --out-freq 1000 --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.SBayesR'), intern=T)
# Read SbayesR heritability result
if(file.exists(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.SBayesR.parRes'))){
par_res_file_i<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.SBayesR.parRes'))
par_res_file_i<-par_res_file_i[par_res_file_i$V1 == 'hsq',2:3, with=F]
par_res_file_i$P<-pnorm(-abs(par_res_file_i$Mean/par_res_file_i$SD))
par_res_file_i$Gene<-gene_i
par_res_file_i<-par_res_file_i[,c('Gene','Mean','SD','P'), with=F]
names(par_res_file_i)<-c('gene','hsq','se','p')
sbayesr_h2<-rbind(sbayesr_h2, par_res_file_i)
# If SNP-h2 p < 0.01, generate SNP-weights
if(par_res_file_i$p < 0.01){
# Flip the effect of each method to match eqtl sumstats
ref_tmp<-eqtl_gene_i[, c('SNP','A1','A2'), with=F]
#############
# SBayesR
#############
# SBayesR has already been run, so just read in the SNP-weights
sbayesr_score<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.SBayesR.snpRes'))
sbayesr_score<-sbayesr_score[,c('Name','A1','A2','A1Effect'), with=F]
names(sbayesr_score)<-c('SNP','A1','A2','BETA')
# Flip effects so allele match eQTL sumstats
sbayesr_score<-merge(ref_tmp, sbayesr_score, by='SNP', all=T)
sbayesr_score$BETA[which(sbayesr_score$A1.x == sbayesr_score$A2.y)]<--sbayesr_score$BETA[which(sbayesr_score$A1.x == sbayesr_score$A2.y)]
sbayesr_score<-sbayesr_score[,c('SNP','A1.x','A2.x','BETA'), with=F]
names(sbayesr_score)<-c('SNP','A1','A2','BETA')
# Sort score file according eqtl_gene_i
sbayesr_score<-sbayesr_score[match(eqtl_gene_i$SNP, sbayesr_score$SNP),]
#############
# SBayesR robust
#############
# SBayesR format sumstats are already available
# So just run SBayesR with robust setting
log<-system(paste0('/users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh --sbayes R --ldm /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr',chr_i,'.ldm.sparse --pi 0.95,0.02,0.02,0.01 --gamma 0.0,0.01,0.1,1 --gwas-summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.txt --robust --chain-length 10000 --exclude-mhc --burn-in 2000 --out-freq 1000 --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR_robust/',gene_i,'.SBayesR'), intern=T)
# Read in the results
sbayesr_robust_score<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR_robust/',gene_i,'.SBayesR.snpRes'))
sbayesr_robust_score<-sbayesr_robust_score[,c('Name','A1','A2','A1Effect'), with=F]
names(sbayesr_robust_score)<-c('SNP','A1','A2','BETA')
# Flip effects so allele match eQTL sumstats
sbayesr_robust_score<-merge(ref_tmp, sbayesr_robust_score, by='SNP', all=T)
sbayesr_robust_score$BETA[which(sbayesr_robust_score$A1.x == sbayesr_robust_score$A2.y)]<--sbayesr_robust_score$BETA[which(sbayesr_robust_score$A1.x == sbayesr_robust_score$A2.y)]
sbayesr_robust_score<-sbayesr_robust_score[,c('SNP','A1.x','A2.x','BETA'), with=F]
names(sbayesr_robust_score)<-c('SNP','A1','A2','BETA')
# Sort score file according eqtl_gene_i
sbayesr_robust_score<-sbayesr_robust_score[match(eqtl_gene_i$SNP, sbayesr_robust_score$SNP),]
#############
# DBSLMM
#############
# Convert to GEMMA format
eqtl_gene_i_dbslmm<-eqtl_gene_i
eqtl_gene_i_dbslmm$N_MISS<-max(eqtl_gene_i_dbslmm$N)-eqtl_gene_i_dbslmm$N
eqtl_gene_i_dbslmm<-eqtl_gene_i_dbslmm[,c('CHR','SNP','BP','N_MISS','N','A1','A2','maf','slope','slope_se','pval_nominal'),with=F]
names(eqtl_gene_i_dbslmm)<-c('chr','rs','ps','n_mis','n_obs','allele1','allele0','af','beta','se','p_wald')
# Match allele1 and 0 with A1 and 2 in reference (DBSLMM calls this allele discrepancy)
ref_bim_subset<-fread(paste0('/scratch/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/EUR/EUR_phase3.MAF_001.chr',chr_i,'.bim'))
GWAS_match<-merge(eqtl_gene_i_dbslmm, ref_bim_subset[,c('V2','V5','V6'),with=F], by.x=c('rs','allele1','allele0'), by.y=c('V2','V5','V6'))
GWAS_switch<-merge(eqtl_gene_i_dbslmm, ref_bim_subset[,c('V2','V5','V6'),with=F], by.x=c('rs','allele1','allele0'), by.y=c('V2','V6','V5'))
GWAS_switch$allele_tmp<-GWAS_switch$allele0
GWAS_switch$allele0<-GWAS_switch$allele1
GWAS_switch$allele1<-GWAS_switch$allele_tmp
GWAS_switch$allele_tmp<-NULL
GWAS_switch$beta<--GWAS_switch$beta
GWAS_switch$af<-1-GWAS_switch$af
GWAS<-rbind(GWAS_match, GWAS_switch)
GWAS<-GWAS[order(GWAS$chr, GWAS$ps),]
GWAS<-GWAS[,c('chr','rs','ps','n_mis','n_obs','allele1','allele0','af','beta','se','p_wald'),with=F]
GWAS_N<-mean(GWAS$n_obs)
nsnp<-nrow(GWAS)
# Write out formatted sumstats
fwrite(GWAS, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/DBSLMM/',gene_i,'.DBSLMM'), sep='\t', col.names=F)
# Run dbslmm
system('chmod 777 /users/k1806347/brc_scratch/Software/DBSLMM/software/dbslmm')
system(paste0('/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/DBSLMM/software/DBSLMM.R --plink /users/k1806347/brc_scratch/Software/plink1.9.sh --block /users/k1806347/brc_scratch/Data/LDetect/EUR/fourier_ls-chr',chr_i,'.bed --dbslmm /users/k1806347/brc_scratch/Software/DBSLMM/software/dbslmm --h2 ',par_res_file_i$hsq,' --ref /scratch/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/EUR/EUR_phase3.MAF_001.chr',chr_i,' --summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/DBSLMM/',gene_i,'.DBSLMM --n ',round(GWAS_N,0),' --nsnp ',nsnp,' --outPath /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/DBSLMM/ --thread 1'))
# Read in the results
dbslmm_score<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/DBSLMM/',gsub('\\..*','',gene_i),'.dbslmm.txt'))
dbslmm_score<-dbslmm_score[,c(1,2,4), with=T]
names(dbslmm_score)<-c('SNP','A1','BETA')
# Flip effects so allele match eQTL sumstats
dbslmm_score<-merge(ref_tmp, dbslmm_score, by='SNP', all=T)
dbslmm_score$BETA[which(dbslmm_score$A1.x != dbslmm_score$A1.y)]<--dbslmm_score$BETA[which(dbslmm_score$A1.x != dbslmm_score$A1.y)]
dbslmm_score<-dbslmm_score[,c('SNP','A1.x','A2','BETA'), with=F]
names(dbslmm_score)<-c('SNP','A1','A2','BETA')
# Sort score file according eqtl_gene_i
dbslmm_score<-dbslmm_score[match(eqtl_gene_i$SNP, dbslmm_score$SNP),]
##############
# PRScs
##############
# Format for PRScs
eqtl_gene_i_prscs<-eqtl_gene_i[,c('SNP','A1','A2','slope','pval_nominal'), with=F]
names(eqtl_gene_i_prscs)<-c('SNP','A1','A2','BETA','P')
# write in PRScs format
fwrite(eqtl_gene_i_prscs, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/PRScs/',gene_i,'.txt'), sep=' ', na = "NA", quote=F)
system(paste0('/users/k1806347/brc_scratch/Software/PRScs.sh --ref_dir=/users/k1806347/brc_scratch/Software/PRScs/ldblk_1kg_eur --bim_prefix=/scratch/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/EUR/EUR_phase3.MAF_001.chr',chr_i,' --sst_file=/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/PRScs/',gene_i,'.txt --n_gwas=',round(GWAS_N,0),' --out_dir=/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/PRScs/',gene_i,' --chrom=',chr_i))
# Read in the results
prscs_score<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/PRScs/',gene_i,'_pst_eff_a1_b0.5_phiauto_chr22.txt'))
prscs_score<-prscs_score[,c('V2','V4','V6'), with=F]
names(prscs_score)<-c('SNP','A1','BETA')
# Flip effects so allele match eQTL sumstats
prscs_score<-merge(ref_tmp, prscs_score, by='SNP', all=T)
prscs_score$BETA[which(prscs_score$A1.x != prscs_score$A1.y)]<--prscs_score$BETA[which(prscs_score$A1.x != prscs_score$A1.y)]
prscs_score<-prscs_score[,c('SNP','A1.x','A2','BETA'), with=F]
names(prscs_score)<-c('SNP','A1','A2','BETA')
# Sort score file according eqtl_gene_i
prscs_score<-prscs_score[match(eqtl_gene_i$SNP, prscs_score$SNP),]
# Note PRScs takes a lot longer than other methods.
################
# SuSiE finemapping
################
# Read LD estimates for eQTL sumstats
write.table(eqtl_gene_i$SNP, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SuSiE/',gene_i,'_snps.txt'), col.names=F, row.names=F, quote=F)
system(paste0('/users/k1806347/brc_scratch/Software/plink1.9.sh --bfile /scratch/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/EUR/EUR_phase3.MAF_001.chr',chr_i,' --extract /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SuSiE/',gene_i,'_snps.txt --r square --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SuSiE/',gene_i))
ld<-as.matrix(fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SuSiE/',gene_i,'.ld')))
library(susieR)
tryCatch(fitted_rss <- susie_rss(eqtl_gene_i$slope/eqtl_gene_i$slope_se, ld, L = 10), error = function(e){skip_to_next <<- TRUE})
skip_to_next<-F
if(skip_to_next == TRUE){
susie_score<-data.table(SNP=eqtl_gene_i$SNP,
A1=eqtl_gene_i$A1,
BETA=NA)
} else {
susie_score<-data.table(SNP=eqtl_gene_i$SNP,
A1=eqtl_gene_i$A1,
BETA=eqtl_gene_i$slope*fitted_rss$pip)
}
# Flip effects so allele match eQTL sumstats
susie_score<-merge(ref_tmp, susie_score, by='SNP', all=T)
susie_score$BETA[which(susie_score$A1.x != susie_score$A1.y)]<--susie_score$BETA[which(susie_score$A1.x != susie_score$A1.y)]
susie_score<-susie_score[,c('SNP','A1.x','A2','BETA'), with=F]
names(susie_score)<-c('SNP','A1','A2','BETA')
# Sort score file according eqtl_gene_i
susie_score<-susie_score[match(eqtl_gene_i$SNP, susie_score$SNP),]
# Create RDat file for FUSION
cv.performance<-as.matrix(data.frame(sbayesr=c(NA,NA),
sbayesr_robust=c(NA,NA),
dbslmm=c(NA,NA),
prscs=c(NA,NA),
susie=c(NA,NA),
top1=c(NA,NA),
row.names=c('rsq','pval')))
# Sort eQTL data into the same order as other score files
eqtl_gene_i<-eqtl_gene_i[match(sbayesr_score$SNP, eqtl_gene_i$SNP),]
hsq<-c(par_res_file_i$hsq, par_res_file_i$se)
hsq.pv<-par_res_file_i$p
N.tot<-max(eqtl_gene_i$N)
eqtl_gene_i$POS<-0
snps<-eqtl_gene_i[,c('CHR','SNP','POS','BP','A1','A2')]
names(snps)<-paste0('V',1:length(names(snps)))
wgt.matrix<-as.matrix(data.frame(sbayesr=sbayesr_score$BETA,
sbayesr_robust=sbayesr_robust_score$BETA,
dbslmm=dbslmm_score$BETA,
prscs=prscs_score$BETA,
susie=susie_score$BETA,
top1=eqtl_gene_i$slope,
row.names=snps$V2))
save(cv.performance,
hsq,
hsq.pv,
N.tot,
snps,
wgt.matrix,
file = paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/RDat_files/',gene_i,'.RDat'))
}
}
}
}
library(data.table)
fusion_pos<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/Whole_Blood/Whole_Blood.pos')
fusion_pos<-fusion_pos[fusion_pos$CHR == 22,]
dim(fusion_pos) # 234
eqtl_derived_files<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/RDat_files/', pattern='.RDat')
fusion_files<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/Whole_Blood/Whole_Blood/', pattern='.RDat')
length(eqtl_derived_files) # 159
length(fusion_files) # 6006
eqtl_derived_files<-gsub('.RDat','',eqtl_derived_files)
fusion_files<-gsub('.wgt.RDat','',gsub('Whole_Blood.','',fusion_files))
genes<-intersect(eqtl_derived_files, fusion_files)
length(genes) # 33
nsnp_res<-NULL
hsq_res<-NULL
cor_res<-list()
for(genes_i in genes){
load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/RDat_files/',genes_i,'.RDat'))
eqtl_derived<-data.frame(wgt.matrix)
if(sum(eqtl_derived$susie) == 0){
eqtl_derived$susie<-NA
}
eqtl_derived$SNP<-row.names(eqtl_derived)
sbayesr<-c(hsq, hsq.pv)
load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/Whole_Blood/Whole_Blood/Whole_Blood.',genes_i,'.wgt.RDat'))
fusion<-data.frame(wgt.matrix)
fusion$SNP<-row.names(fusion)
greml<-c(hsq, hsq.pv)
nsnp_res<-rbind(nsnp_res, data.frame(gene_id=genes_i,
nsnp_eqtl_derived=nrow(eqtl_derived),
nsnp_fusion=nrow(fusion)))
hsq_res<-rbind(hsq_res, data.frame(gene_id=genes_i,
sbayesr_h2=sbayesr[1],
sbayesr_se=sbayesr[2],
sbayesr_p=sbayesr[3],
greml_h2=greml[1],
greml_se=greml[2],
greml_p=greml[3]))
both<-merge(eqtl_derived, fusion, by='SNP')
if(nrow(both) > 0){
cor_res[[genes_i]]<-cor(both[,-1], use='p')
}
}
# Compare h2
cor(hsq_res$sbayesr_h2, hsq_res$greml_h2) # 0.7500682
library(ggplot2)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/h2_comp.png', units='px', width=1000, height=1000, res=300)
ggplot(hsq_res, aes(x=sbayesr_h2, y=greml_h2)) +
geom_abline(intercept =0 , slope = 1, colour='red') +
geom_point() +
coord_fixed() +
xlim(0,1) +
ylim(0,1)
dev.off()
# Compare nsnp
cor(nsnp_res$nsnp_eqtl_derived, nsnp_res$nsnp_fusion) # 0.4849533
library(ggplot2)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/nsnp_comp.png', units='px', width=1000, height=1000, res=300)
ggplot(nsnp_res, aes(x=nsnp_eqtl_derived, y=nsnp_fusion)) +
geom_abline(intercept =0 , slope = 1, colour='red') +
geom_point() +
coord_fixed()
dev.off()
# Check SNP-weight correlations
cor_res_melt<-melt(cor_res)
cor_res_melt$Var1<-gsub('top1.x','top1.GTEx',cor_res_melt$Var1)
cor_res_melt$Var2<-gsub('top1.x','top1.GTEx',cor_res_melt$Var2)
cor_res_melt$Var1<-gsub('top1.y','top1.FUSION',cor_res_melt$Var1)
cor_res_melt$Var2<-gsub('top1.y','top1.FUSION',cor_res_melt$Var2)
cor_res_melt$test<-paste0(cor_res_melt$Var1,'_',cor_res_melt$Var2)
cor_res_average<-NULL
for(i in unique(cor_res_melt$test)){
cor_res_average<-rbind(cor_res_average, data.frame(test=i,
Mean=mean(cor_res_melt$value[cor_res_melt$test == i])))
}
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/cor_plot.png', units='px', width=3000, height=3000, res=300)
ggplot(cor_res_melt, aes(x=paste0(Var1,' vs. ',Var2), y=value)) +
geom_boxplot() +
coord_flip()
dev.off()
Show GREML and SBayesR SNP-h2 estimates
Show number of variants in FUSION and eQTL-based models
Show correlation between FUSION and eQTL-based models
The correlation between eQTL derived TWAS weights and FUSION derived TWAS weights is similar to correlation between different FUSION models.
The correlation between eQTL derived and FUSION weights is of interest, but not a good evaluation metric of methods, We should compare the correlation between predicted and observed expression, when using eQTL derived TWAS weights or largest FUSION TWAS weights.
The low correlation between topSNP results from FUSION and GTEx indicates comparison of FUSION and eQTL susmtat derived weights are not very informative.
Here we will predict gene expression levels in the GTEx v8 sample and then test the correlation with observered expression levels. Initially, we will use FUSION SNP-weights derived using the YFS whole blood sample, and eQTL data from eQTLGen whole blood meta-analysis (excl GTEx).
I do not have access to individual level data from GTeX v8, so this project is being carried out in collaboration with Zac Gerring. Here, I will prepare the SNP-weights and the code to predict gene expression levels for Zac. As an example target sample, I will use the EUR subset of the 1KG Phase 3 reference.
I have already written a script to predicted expression levels from TWAS weights called FeaturePred.
Show code
# Download the full cis-eQTL data exluding GTEx
# This was sent privately by Urmo Vosa
mkdir /users/k1806347/brc_scratch/Data/eQTLGen/excl_GTEx
# Extract relevent columns
zcat /users/k1806347/brc_scratch/Data/eQTLGen/excl_GTEx/2019-12-11-cis-eQTLsFDR-ProbeLevel-CohortInfoRemoved-WoGTEx.txt.gz | cut -f 1-5,9-11,13 | gzip > /users/k1806347/brc_scratch/Data/eQTLGen/excl_GTEx/2019-12-11-cis-eQTLsFDR-ProbeLevel-CohortInfoRemoved-WoGTEx.small.txt.gz
library(data.table)
# Read in relevent columns from the sumstats
sumstats<-fread('/users/k1806347/brc_scratch/Data/eQTLGen/excl_GTEx/2019-12-11-cis-eQTLsFDR-ProbeLevel-CohortInfoRemoved-WoGTEx.small.txt.gz')
# Extract data for each gene
genes<-unique(sumstats$ProbeName)
# Read in EUR MAF
frq<-NULL
for(i in 1:22){
frq<-rbind(frq, fread(paste0('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR.',i,'.frq')))
}
names(frq)[names(frq) == 'MAF']<-'FREQ'
# Process eQTL sumstats for each gene
# For testing use only first 100 genes
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen')
for(i in 1:100){
print(i)
tmp<-sumstats[sumstats$ProbeName == genes[i],]
# Create A1 and A2 columns and update column names
tmp$A1<-gsub('.*/','',tmp$SNPType)
tmp$A2<-gsub('/.*','',tmp$SNPType)
tmp$A2[tmp$A1 != tmp$AlleleAssessed]<-gsub('.*/','',tmp$SNPType[tmp$A1 != tmp$AlleleAssessed])
tmp$A1[tmp$A1 != tmp$AlleleAssessed]<-gsub('/.*','',tmp$SNPType[tmp$A1 != tmp$AlleleAssessed])
tmp<-tmp[,c('ProbeName','SNPName','SNPChr','SNPChrPos', 'A1', 'A2','PValue','OverallZScore','SumNumberOfSamples'), with=F]
names(tmp)<-c('Gene','SNP','CHR','BP','A1','A2','P','Z','N')
# Insert FREQ from EUR reference
# There don't seem to be any strand flips
tmp_match<-merge(tmp, frq[,c('SNP','A1','A2','FREQ'),with=F], by=c('SNP','A1','A2'))
tmp_flip<-merge(tmp, frq[,c('SNP','A1','A2','FREQ'),with=F], by.x=c('SNP','A1','A2'), by.y=c('SNP','A2','A1'))
tmp_flip$FREQ<-1-tmp_flip$FREQ
tmp<-rbind(tmp_match, tmp_flip)
# Approximate BETA, SE, and P
tmp$P<-2*pnorm(-abs(tmp$Z))
tmp$BETA<-tmp$Z/sqrt((2*tmp$FREQ)*(1-tmp$FREQ)*(tmp$N+sqrt(abs(tmp$Z))))
tmp$SE<-abs(tmp$BETA)/abs(tmp$Z)
tmp$GENE<-genes[i]
tmp<-tmp[,c('GENE','CHR','SNP','BP','A1','A2','BETA','SE','Z','P','FREQ','N'),with=F]
if(i == 1){
write.table(tmp, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt', col.names=T, row.names=F, quote=F)
} else {
write.table(tmp, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt', col.names=F, row.names=F, quote=F, append=T)
}
}
Show code
# Script modified to run only first 100 genes
sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/eQTL_to_TWAS/compute_weights.R \
--extract /users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist \
--sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt \
--gctb /users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh \
--gctb_ref /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr \
--plink_ref_chr /users/k1806347/brc_scratch/Data/1KG/Phase3/1KGPhase3.w_hm3.chr \
--ld_blocks /users/k1806347/brc_scratch/Data/LDetect/EUR \
--rscript /users/k1806347/brc_scratch/Software/Rscript.sh \
--dbslmm /users/k1806347/brc_scratch/Software/DBSLMM/software \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--PRScs_path /users/k1806347/brc_scratch/Software/PRScs.sh \
--PRScs_ref_path /users/k1806347/brc_scratch/Software/PRScs/ldblk_1kg_eur \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/twas_weights
# Rename RDat folder to match FUSION format weights
mv /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/RDat_files /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL
# Create .pos file
library(data.table)
# Read in list of RDat files
rdat_list<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL', pattern='.RDat')
# Start making pos file
pos<-data.frame(PANEL='eQTLGen.eQTL',
WGT=paste0('eQTLGen.eQTL/',rdat_list),
ID=gsub('\\..*','',rdat_list))
# Insert CHR, P0 and P1 (GRCh=37)
library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
Genes<-getBM(attributes=c('ensembl_gene_id_version','chromosome_name','start_position','end_position'), mart = ensembl)
Genes$ensembl_gene_id_version<-gsub('\\..*','',Genes$ensembl_gene_id_version)
names(Genes)<-c('ID','CHR','P0','P1')
Genes<-Genes[complete.cases(Genes),]
Genes<-Genes[!duplicated(Genes),]
pos<-merge(pos, Genes, all.x=T, by='ID')
# Read in each RDat file to retrieve N
for(i in 1:length(rdat_list)){
load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL/',rdat_list[i]))
pos$N[i]<-N.tot
}
pos<-pos[,c('PANEL','WGT','ID','CHR','P0','P1','N')]
write.table(pos, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL.pos', col.names=T, row.names=F, quote=F)
FUSION only provide the N of the sample and the Z of each SNP for each gene. We can convert this to the required information using reference MAF, though using these approximations is not ideal. This may not be the best comparison since they are finish and there may therefore be MAF and LD mismatch with target sample. Though I assume FUSION developers restricted the analysis to those os EUR ancestry.
Show code
library(data.table)
# Read in the pos file
pos<-fread('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR.pos')
# Read in EUR MAF
frq<-NULL
for(i in 1:22){
frq<-rbind(frq, fread(paste0('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR.',i,'.frq')))
}
names(frq)[names(frq) == 'MAF']<-'FREQ'
for(i in 1:nrow(pos)){
print(i)
load(paste0('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/',pos$WGT[i]))
tmp<-data.table(cbind(snps, wgt.matrix[,which(colnames(wgt.matrix) == 'top1')]))
names(tmp)<-c('CHR','SNP','POS','BP','A1','A2','Z')
tmp$N<-pos$N[i]
# Insert EUR MAF
tmp_match<-merge(tmp, frq[,c('SNP','A1','A2','FREQ'),with=F], by=c('SNP','A1','A2'))
tmp_flip<-merge(tmp, frq[,c('SNP','A1','A2','FREQ'),with=F], by.x=c('SNP','A1','A2'), by.y=c('SNP','A2','A1'))
tmp_flip$FREQ<-1-tmp_flip$FREQ
tmp<-rbind(tmp_match, tmp_flip)
# Approximate BETA, SE, and P
tmp$P<-2*pnorm(-abs(tmp$Z))
tmp$BETA<-tmp$Z/sqrt((2*tmp$FREQ)*(1-tmp$FREQ)*(tmp$N+sqrt(abs(tmp$Z))))
tmp$SE<-abs(tmp$BETA)/abs(tmp$Z)
tmp$GENE<-pos$ID[i]
tmp<-tmp[,c('GENE','CHR','SNP','BP','A1','A2','BETA','SE','Z','P','FREQ','N'),with=F]
if(i == 1){
write.table(tmp, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt', col.names=T, row.names=F, quote=F)
} else {
write.table(tmp, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt', col.names=F, row.names=F, quote=F, append=T)
}
}
# Make this the input format for eQTL_to_TWAS script.
# Incorporate DENTIST QC of sumstats?
Show code
# Script modified to run only first 100 genes
sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/eQTL_to_TWAS/compute_weights.R \
--extract /users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist \
--sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt \
--gctb /users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh \
--gctb_ref /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr \
--plink_ref_chr /users/k1806347/brc_scratch/Data/1KG/Phase3/1KGPhase3.w_hm3.chr \
--ld_blocks /users/k1806347/brc_scratch/Data/LDetect/EUR \
--rscript /users/k1806347/brc_scratch/Software/Rscript.sh \
--dbslmm /users/k1806347/brc_scratch/Software/DBSLMM/software \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--PRScs_path /users/k1806347/brc_scratch/Software/PRScs.sh \
--PRScs_ref_path /users/k1806347/brc_scratch/Software/PRScs/ldblk_1kg_eur \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/twas_weights
# Rename RDat folder to match FUSION format weights
mv /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/RDat_files /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL
# Create .pos file
library(data.table)
pos<-fread('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR.pos')
# Read in list of RDat files
rdat_list<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL', pattern='.RDat')
# Subset pos file to those with RDat file
pos<-pos[pos$ID %in% gsub('.RDat','',rdat_list),]
pos$PANEL<-'YFS.eQTL'
pos$WGT<-paste0(pos$ID,'.RDat')
write.table(pos, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL.pos', col.names=T, row.names=F, quote=F)
Show code
sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R \
--PLINK_prefix_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--weights /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR.pos \
--weights_dir /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR \
--ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--n_cores 1 \
--save_score T \
--save_ref_expr T \
--memory 10000 \
--all_mod T \
--pigz /users/k1806347/brc_scratch/Software/pigz \
--ref_maf /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS.BLOOD.RNAARR
Show code
sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R \
--PLINK_prefix_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL.pos \
--weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS \
--ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--n_cores 1 \
--save_score T \
--save_ref_expr T \
--memory 10000 \
--all_mod T \
--pigz /users/k1806347/brc_scratch/Software/pigz \
--ref_maf /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS.eQTL
Show code
sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R \
--PLINK_prefix_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL.pos \
--weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen \
--ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--n_cores 1 \
--save_score T \
--save_ref_expr T \
--memory 10000 \
--all_mod T \
--pigz /users/k1806347/brc_scratch/Software/pigz \
--ref_maf /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/eQTLGen.eQTL
Show code
library(data.table)
fusion<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS.BLOOD.RNAARR/FeaturePredictions_YFS.BLOOD.RNAARR.txt.gz')
eqtl<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS.eQTL/FeaturePredictions_YFS.eQTL.txt.gz')
genes<-unique(gsub('\\..*','',gsub('YFS.eQTL.','',names(eqtl)[-1:-2])))
fusion<-fusion[,grepl(paste0('FID|IID|',paste(genes,collapse='|')),names(fusion)), with=F]
both<-merge(fusion, eqtl, by=c('FID','IID'))
cor_res<-list()
for(i in genes){
tmp<-both[,grepl(paste0('\\.',i,'\\.'), names(both)), with=F]
names(tmp)<-gsub('YFS.eQTL.RDat.','eqtl.',gsub('YFS.BLOOD.RNAARR.YFS.','twas.',gsub(paste0(i,'.'),'', names(tmp))))
cor_res[[i]]<-cor(tmp, use='p')
}
# Check SNP-weight correlations
cor_res_melt<-melt(cor_res)
cor_res_melt$test<-paste0(cor_res_melt$Var1,'_',cor_res_melt$Var2)
cor_res_average<-NULL
for(i in unique(cor_res_melt$test)){
cor_res_average<-rbind(cor_res_average, data.frame(test=i,
Mean=mean(cor_res_melt$value[cor_res_melt$test == i])))
}
library(ggplot2)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS_cor_plot.png', units='px', width=3000, height=3000, res=300)
ggplot(cor_res_melt, aes(x=paste0(Var1,' vs. ',Var2), y=value)) +
geom_boxplot() +
coord_flip()
dev.off()
# Looks as expected, except eQTL.top1 model doesn't correlate well with fusion.top1 model. In many instances the correlation is 1, but there are many low correlation. This seems to be due to changing the top1 Z score in fusion files to a BETA estimated based on N and MAF. In some instances this changes the top SNP to a SNP that is uncorrelated with the top SNP in the fusion files. Perhaps I should select the top SNP based on the p-value, rather than the BETA. I think selecting based on p-value may be more concordant with standard pT+clump PRS methodology, and will avoid SNPs with small N and large SE being selected. Thougb since we are filtering by N, this shouldn;t be a problem, and usin the SNP with the largest BETA may be more appropriate. However, the correlation between top1 and other models is lower when using the largest abs(BETA), suggesting using the most significant SNP may be more appropriate. This should be discussed with Sasha.
load('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR/YFS.TRNAU1AP.wgt.RDat')
fusion_wgt<-data.frame(wgt.matrix)
load('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL/TRNAU1AP.RDat')
eqtl_wgt<-data.frame(wgt.matrix)
head(fusion_wgt[rev(order(abs(fusion_wgt$top1))),])
head(eqtl_wgt[rev(order(abs(eqtl_wgt$top1))),])
Show correlation between SNP-weights from each method
Looks as expected, except eQTL.top1 model doesn’t correlate well with fusion.top1 model. In many instances the correlation is 1, but there are many low correlation. This seems to be due to changing the top1 Z score in fusion files to a BETA estimated based on N and MAF. In some instances this changes the top SNP to a SNP that is uncorrelated with the top SNP in the fusion files. Perhaps I should select the top SNP based on the p-value, rather than the BETA. I think selecting based on p-value may be more concordant with standard pT+clump PRS methodology, and will avoid SNPs with small N and large SE being selected. Though since we are filtering by N, this shouldn’t be a problem, and using the SNP with the largest BETA may be more appropriate. However, the correlation between top1 and other models is lower when using the largest abs(BETA), suggesting using the most significant SNP may be more appropriate.
This will be run by Zac. I will send him the score files and reference expression, and fusion LD reference data. I don’t know the file paths so I have just put file names.
Show code
mkdir -p /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
# Copy FUSION reference with allele frequency files
cp -r /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF ./
# Copy FUSION .pos, .Rdat files, score files, and reference expression
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
mkdir FUSION.YFS
cd FUSION.YFS
cp /scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR.pos ./
cp -r /scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR ./
# Copy eQTL-based .pos, .Rdat files, score files, and reference expression
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
mkdir eQTL.YFS
cd eQTL.YFS
cp /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL.pos ./
cp -r /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL ./
# Copy pigz binary
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
cp /users/k1806347/brc_scratch/Software/pigz ./
# Copy plink binary
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
cp /users/k1806347/brc_scratch/Software/plink1.9/plink ./
# Copy FeaturePred.V2.0.R
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
cp /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R ./
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS
tar -zcvf for_zac.tar.gz for_zac
##########
# Make a new folder containing updated YFS and eQTLGen TWAS weights
##########
mkdir -p /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac_26042022
# Copy YFS eQTL-based .pos, .Rdat files, score files, and reference expression
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac_26042022
mkdir eQTL.YFS
cd eQTL.YFS
cp /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL.pos ./
cp -r /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL ./
# Copy eQTL-Gen eQTL-based .pos, .Rdat files, score files, and reference expression
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac_26042022
mkdir eQTL.eQTLGen
cd eQTL.eQTLGen
cp /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL.pos ./
cp -r /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL ./
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS
tar -zcvf for_zac_26042022.tar.gz for_zac_26042022
Show code
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
Rscript FeaturePred.V2.0.R \
--PLINK_prefix_chr LDREF/1000G.EUR. \
--weights FUSION.YFS/YFS.BLOOD.RNAARR.pos \
--weights_dir FUSION.YFS \
--ref_ld_chr LDREF/1000G.EUR. \
--plink ./plink \
--n_cores 1 \
--memory 10000 \
--all_mod T \
--pigz ./pigz \
--ref_maf LDREF/1000G.EUR. \
--output test/FUSION.YFS
Show code
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
/users/k1806347/brc_scratch/Software/Rscript.sh FeaturePred.V2.0.nodel2.R \
--PLINK_prefix_chr LDREF/1000G.EUR. \
--weights eQTL.YFS/YFS.eQTL.pos \
--weights_dir eQTL.YFS \
--ref_ld_chr LDREF/1000G.EUR. \
--plink ./plink \
--n_cores 1 \
--memory 10000 \
--all_mod T \
--pigz ./pigz \
--ref_maf LDREF/1000G.EUR. \
--output test/eQTL.YFS
Zac now sends me the predicted expression data for GTEx v8 so I can compare predicted and observed expression. Observed expression in GTEx is publicly available.
Here I am reading in observed and predicted expression values, testing their correlation, and then summarising the results across gene expression imputation methods. I am using the processed and normalised observed expression in GTEx. First I covary observed expression for covariates used in the original eQTL analysis by GTEx.
Show code
library(data.table)
# Read in the observed expression
obs<-fread('/users/k1806347/brc_scratch/Data/GTeX/v8/GTEx_Analysis_v8_eQTL_expression_matrices/Whole_Blood.v8.normalized_expression.bed.gz')
# Insert external_gene_name
obs<-obs[,-1:-3]
obs$gene_id<-gsub('\\..*','',obs$gene_id)
library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
Genes<-getBM(attributes=c('ensembl_gene_id_version','external_gene_name'), mart = ensembl)
Genes$ensembl_gene_id_version<-gsub('\\..*','',Genes$ensembl_gene_id_version)
obs<-merge(obs, Genes, by.x='gene_id', by.y='ensembl_gene_id_version')
obs<-obs[,c('external_gene_name',names(obs)[grepl('GTEX-', names(obs))]), with=F]
obs<-t(obs)
colnames(obs)<-obs[1,]
obs<-obs[-1,]
obs<-cbind(row.names(obs),obs)
colnames(obs)[1]<-'ID'
obs<-data.table(obs)
obs<-cbind(obs[,1],data.frame(lapply(obs[,-1], function(x) as.numeric(as.character(x)))))
# Read in covariates
covs<-fread('/users/k1806347/brc_scratch/Data/GTeX/v8/GTEx_Analysis_v8_eQTL_covariates/Whole_Blood.v8.covariates.txt')
covs<-t(covs)
colnames(covs)<-covs[1,]
covs<-covs[-1,]
covs<-cbind(row.names(covs),covs)
colnames(covs)[1]<-'ID'
covs<-data.table(covs)
covs<-cbind(covs[,1],data.frame(lapply(covs[,-1], function(x) as.numeric(as.character(x)))))
# Merge observed expression and covariates
obs<-obs[,!duplicated(names(obs)),with=F]
obs_covs<-merge(obs, covs, by='ID')
# Read in predicted expression
eqtl<-fread('/mnt/lustre/users/k1806347/Data/GTeX/v8/Zac/eQTL.YFS.26042022/FeaturePredictions_YFS.eQTL.txt.gz')
names(eqtl)<-gsub('.RDat','',gsub('YFS.','', names(eqtl)))
fusion<-fread('/mnt/lustre/users/k1806347/Data/GTeX/v8/Zac/FUSION.YFS/FeaturePredictions_YFS.BLOOD.RNAARR.txt.gz')
names(fusion)<-gsub('YFS.BLOOD.RNAARR.YFS.','fusion.', names(fusion))
# Identify genes available in fusion and eqtl based models
obs_genes<-names(obs)[-1]
eqtl_genes<-unique(gsub('\\..*','',gsub('eQTL.','',names(eqtl)[-1:-2])))
fusion_genes<-unique(gsub('\\..*','',gsub('fusion.','',names(fusion)[-1:-2])))
both_genes<-intersect(obs_genes, eqtl_genes)
both_genes<-intersect(both_genes, fusion_genes)
# Residualise the covariates
obs_resid<-data.frame(ID=obs_covs$ID, stringsAsFactors=F)
for(i in both_genes){
obs_resid[[i]]<-as.numeric(scale(resid(lm(as.formula(paste0(i,' ~ ', paste(names(covs)[-1], collapse=' + '))), data=obs_covs))))
}
obs_resid<-data.table(obs_resid)
# Merge eqtl and fusion predicted expression
pred_exp<-merge(eqtl, fusion, by=c('FID','IID'))
# Calculate correlation between observed expression and predicted expression from each method
cor_res<-list()
for(i in both_genes){
# Rename columns to make output consistent across genes and label the observed expression
pred_exp_tmp<-pred_exp[,grepl(paste0('FID$|IID$|\\.',i,'\\.'), names(pred_exp)), with=F]
names(pred_exp_tmp)<-gsub(paste0('.',i), '', names(pred_exp_tmp))
obs_exp_tmp<-obs_resid[,c('ID',i), with=F]
names(obs_exp_tmp)[names(obs_exp_tmp) == i]<-'obs'
# merge predicted and observed expression
both_exp_tmp<-merge(pred_exp_tmp, obs_exp_tmp, by.x='IID', by.y='ID')
both_exp_tmp$obs<-as.numeric(both_exp_tmp$obs)
# Calculate correlation
cor_res[[i]]<-cor(both_exp_tmp[,-1:-2,with=F], use='p')
}
# melt and combine all the results
cor_res_melt<-melt(cor_res)
cor_res_melt_obs<-cor_res_melt[cor_res_melt$Var1 == 'obs',]
cor_res_melt_obs<-cor_res_melt_obs[cor_res_melt_obs$Var1 != cor_res_melt_obs$Var2,]
cor_res_melt_obs$test<-paste0(cor_res_melt_obs$Var1,'_',cor_res_melt_obs$Var2)
cor_res_average<-NULL
for(i in unique(cor_res_melt_obs$test)){
cor_res_average<-rbind(cor_res_average, data.frame(test=i,
Mean=mean(cor_res_melt_obs$value[cor_res_melt_obs$test == i])))
}
library(ggplot2)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_boxplot.png', units='px', width=1500, height=1000, res=300)
ggplot(cor_res_melt_obs, aes(x=paste0(Var1,' vs. ',Var2), y=value)) +
geom_boxplot() +
labs(x='Test', y='Correlation') +
coord_flip()
dev.off()
# Make a pairs plot
library("ggplot2")
library("GGally")
cor_res_melt_obs_unmelt<-dcast(data = cor_res_melt_obs,formula = L1~Var2,fun.aggregate = sum,value.var = "value")
cor_res_melt_obs_unmelt$negative
lowerfun <- function(data,mapping){
ggplot(data = data, mapping = mapping)+
geom_point() +
geom_abline(intercept =0 , slope = 1, colour='red') +
geom_vline(xintercept= 0, colour='blue') +
geom_hline(yintercept= 0, colour='blue') +
xlim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))+
ylim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))
}
diagfun <- function(data,mapping){
ggplot(data = data, mapping = mapping)+
geom_density() +
geom_vline(xintercept= 0, colour='blue') +
xlim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))
}
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_pairsplot.png', units='px', width=5000, height=5000, res=300)
ggpairs(cor_res_melt_obs_unmelt[,-1], lower = list(continuous = wrap(lowerfun)), diag = list(continuous = wrap(diagfun)))
dev.off()
# Count the number of genes with correlation > 0.1 for each method
n_valid<-NULL
for(i in unique(cor_res_melt_obs$Var2)){
n_valid<-rbind(n_valid, data.frame(Method=i,
N_valid=sum(cor_res_melt_obs$value[cor_res_melt_obs$Var2 == i] > 0.1, na.rm=T),
median_cor=median(cor_res_melt_obs$value[cor_res_melt_obs$Var2 == i], na.rm=T)))
}
n_valid<-n_valid[order(-n_valid$N_valid),]
n_valid$Freq_valid<-n_valid$N_valid/length(unique(cor_res_melt_obs$L1))
write.csv(n_valid, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/n_valid.png', row.names=F)
# Read in reported R2 by weights
yfs.profile<-fread('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR.profile')
Show correlation between observed and predicted expression
| Method | N_valid | median_cor | Freq_valid |
|---|---|---|---|
| fusion.enet | 18 | 0.0508820 | 0.2000000 |
| eQTL.dbslmm | 16 | 0.0430612 | 0.1777778 |
| fusion.bslmm | 16 | 0.0473674 | 0.1777778 |
| fusion.lasso | 16 | 0.0456552 | 0.1777778 |
| eQTL.prscs | 15 | 0.0435965 | 0.1666667 |
| eQTL.top1 | 15 | 0.0279044 | 0.1666667 |
| fusion.top1 | 15 | 0.0394055 | 0.1666667 |
| eQTL.sbayesr_robust | 13 | 0.0394760 | 0.1444444 |
| fusion.blup | 12 | 0.0430674 | 0.1333333 |
| eQTL.sbayesr | 10 | 0.0321927 | 0.1111111 |
| eQTL.susie | 9 | 0.0467886 | 0.1000000 |
The correlations are very low compared the R2 reported in the FUSION profile. This is true when using fusion or eQTL based models. Is this due to poor generalisablity across YFS and GTEx?
Show code
library(data.table)
# Read in the observed expression
obs<-fread('/users/k1806347/brc_scratch/Data/GTeX/v8/GTEx_Analysis_v8_eQTL_expression_matrices/Whole_Blood.v8.normalized_expression.bed.gz')
# Insert external_gene_name
obs<-obs[,-1:-3]
obs$gene_id<-gsub('\\..*','',obs$gene_id)
obs<-t(obs)
colnames(obs)<-obs[1,]
obs<-obs[-1,]
obs<-cbind(row.names(obs),obs)
colnames(obs)[1]<-'ID'
obs<-data.table(obs)
obs<-cbind(obs[,1],data.frame(lapply(obs[,-1], function(x) as.numeric(as.character(x)))))
# Read in covariates
covs<-fread('/users/k1806347/brc_scratch/Data/GTeX/v8/GTEx_Analysis_v8_eQTL_covariates/Whole_Blood.v8.covariates.txt')
covs<-t(covs)
colnames(covs)<-covs[1,]
covs<-covs[-1,]
covs<-cbind(row.names(covs),covs)
colnames(covs)[1]<-'ID'
covs<-data.table(covs)
covs<-cbind(covs[,1],data.frame(lapply(covs[,-1], function(x) as.numeric(as.character(x)))))
# Merge observed expression and covariates
obs<-obs[,!duplicated(names(obs)),with=F]
obs_covs<-merge(obs, covs, by='ID')
# Read in predicted expression
eqtl<-fread('/mnt/lustre/users/k1806347/Data/GTeX/v8/Zac/eQTLGen.26042022/FeaturePredictions_eQTLGen.eQTL.txt.gz')
names(eqtl)<-gsub('.RDat','',gsub('eQTLGen.','', names(eqtl)))
# Identify genes available in fusion and eqtl based models
obs_genes<-names(obs)[-1]
eqtl_genes<-unique(gsub('\\..*','',gsub('eQTL.','',names(eqtl)[-1:-2])))
both_genes<-intersect(obs_genes, eqtl_genes)
# Residualise the covariates
obs_resid<-data.frame(ID=obs_covs$ID, stringsAsFactors=F)
for(i in both_genes){
obs_resid[[i]]<-as.numeric(scale(resid(lm(as.formula(paste0(i,' ~ ', paste(names(covs)[-1], collapse=' + '))), data=obs_covs))))
}
obs_resid<-data.table(obs_resid)
# Calculate correlation between observed expression and predicted expression from each method
cor_res<-list()
for(i in both_genes){
# Rename columns to make output consistent across genes and label the observed expression
pred_exp_tmp<-eqtl[,grepl(paste0('FID$|IID$|\\.',i,'\\.'), names(eqtl)), with=F]
names(pred_exp_tmp)<-gsub(paste0('.',i), '', names(pred_exp_tmp))
obs_exp_tmp<-obs_resid[,c('ID',i), with=F]
names(obs_exp_tmp)[names(obs_exp_tmp) == i]<-'obs'
# merge predicted and observed expression
both_exp_tmp<-merge(pred_exp_tmp, obs_exp_tmp, by.x='IID', by.y='ID')
both_exp_tmp$obs<-as.numeric(both_exp_tmp$obs)
# Calculate correlation
cor_res[[i]]<-cor(both_exp_tmp[,-1:-2,with=F], use='p')
}
# melt and combine all the results
cor_res_melt<-melt(cor_res)
cor_res_melt_obs<-cor_res_melt[cor_res_melt$Var1 == 'obs',]
cor_res_melt_obs<-cor_res_melt_obs[cor_res_melt_obs$Var1 != cor_res_melt_obs$Var2,]
cor_res_melt_obs$test<-paste0(cor_res_melt_obs$Var1,'_',cor_res_melt_obs$Var2)
cor_res_average<-NULL
for(i in unique(cor_res_melt_obs$test)){
cor_res_average<-rbind(cor_res_average, data.frame(test=i,
Mean=mean(cor_res_melt_obs$value[cor_res_melt_obs$test == i])))
}
library(ggplot2)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_boxplot_eQTLGen.png', units='px', width=1500, height=1000, res=300)
ggplot(cor_res_melt_obs, aes(x=paste0(Var1,' vs. ',Var2), y=value)) +
geom_boxplot() +
labs(x='Test', y='Correlation') +
coord_flip()
dev.off()
# Make a pairs plot
library("ggplot2")
library("GGally")
cor_res_melt_obs_unmelt<-dcast(data = cor_res_melt_obs,formula = L1~Var2,fun.aggregate = sum,value.var = "value")
cor_res_melt_obs_unmelt$negative
lowerfun <- function(data,mapping){
ggplot(data = data, mapping = mapping)+
geom_point() +
geom_abline(intercept =0 , slope = 1, colour='red') +
geom_vline(xintercept= 0, colour='blue') +
geom_hline(yintercept= 0, colour='blue') +
xlim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))+
ylim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))
}
diagfun <- function(data,mapping){
ggplot(data = data, mapping = mapping)+
geom_density() +
geom_vline(xintercept= 0, colour='blue') +
xlim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))
}
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_pairsplot_eQTLGen.png', units='px', width=3000, height=3000, res=300)
ggpairs(cor_res_melt_obs_unmelt[,-1], lower = list(continuous = wrap(lowerfun)), diag = list(continuous = wrap(diagfun)))
dev.off()
# Count the number of genes with correlation > 0.1 for each method
n_valid<-NULL
for(i in unique(cor_res_melt_obs$Var2)){
n_valid<-rbind(n_valid, data.frame(Method=i,
N_valid=sum(cor_res_melt_obs$value[cor_res_melt_obs$Var2 == i] > 0.1, na.rm=T),
median_cor=median(cor_res_melt_obs$value[cor_res_melt_obs$Var2 == i], na.rm=T)))
}
n_valid<-n_valid[order(-n_valid$N_valid),]
n_valid$Freq_valid<-n_valid$N_valid/length(unique(cor_res_melt_obs$L1))
write.csv(n_valid, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/n_valid_eQTLGen.png', row.names=F)
Show correlation between observed and predicted expression
| Method | N_valid | median_cor | Freq_valid |
|---|---|---|---|
| eQTL.sbayesr_robust | 67 | 0.1704210 | 0.8701299 |
| eQTL.top1 | 66 | 0.1817754 | 0.8571429 |
| eQTL.dbslmm | 62 | 0.1494785 | 0.8051948 |
| eQTL.prscs | 53 | 0.1224961 | 0.6883117 |
| eQTL.sbayesr | 43 | 0.1181565 | 0.5584416 |
| eQTL.susie | 36 | 0.1681926 | 0.4675325 |
The predicted-observed correlation is looking better here, with 87% of genes achieving a correlation > 0.1. However, the order of the methods has changed a bit, with top1 and SBayesR (in robust mode) performing best. I am not going to read into the order of methods until we have run across all genes.